{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Transformations\n",
    "\n",
    "You won't always get data in a convienent format, often you will have to deal with data that is non-numerical, such as customer names, or zipcodes, country names, etc...\n",
    "\n",
    "A big part of working with data is using your own domain knowledge to build an intuition of how to deal with the data, sometimes the best course of action is to drop the data, other times feature-engineering is a good way to go, or you could try to transform the data into something the Machine Learning Algorithms will understand.\n",
    "\n",
    "Spark has several built in methods of dealing with thse transformations, check them all out here: http://spark.apache.org/docs/latest/ml-features.html\n",
    "\n",
    "Let's see some examples of all of this!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "spark = SparkSession.builder.appName('data').getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df = spark.read.csv('fake_customers.csv',inferSchema=True,header=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+----------+-----+\n",
      "|   Name|     Phone|Group|\n",
      "+-------+----------+-----+\n",
      "|   John|4085552424|    A|\n",
      "|   Mike|3105552738|    B|\n",
      "| Cassie|4085552424|    B|\n",
      "|  Laura|3105552438|    B|\n",
      "|  Sarah|4085551234|    A|\n",
      "|  David|3105557463|    C|\n",
      "|   Zach|4085553987|    C|\n",
      "|  Kiera|3105552938|    A|\n",
      "|  Alexa|4085559467|    C|\n",
      "|Karissa|3105553475|    A|\n",
      "+-------+----------+-----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Features\n",
    "\n",
    "### StringIndexer\n",
    "\n",
    "We often have to convert string information into numerical information as a categorical feature. This is easily done with the StringIndexer Method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+--------+-------------+\n",
      "|user_id|category|categoryIndex|\n",
      "+-------+--------+-------------+\n",
      "|      0|       a|          0.0|\n",
      "|      1|       b|          2.0|\n",
      "|      2|       c|          1.0|\n",
      "|      3|       a|          0.0|\n",
      "|      4|       a|          0.0|\n",
      "|      5|       c|          1.0|\n",
      "+-------+--------+-------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.feature import StringIndexer\n",
    "\n",
    "df = spark.createDataFrame(\n",
    "    [(0, \"a\"), (1, \"b\"), (2, \"c\"), (3, \"a\"), (4, \"a\"), (5, \"c\")],\n",
    "    [\"user_id\", \"category\"])\n",
    "\n",
    "indexer = StringIndexer(inputCol=\"category\", outputCol=\"categoryIndex\")\n",
    "indexed = indexer.fit(df).transform(df)\n",
    "indexed.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step would be to encode these categories into \"dummy\" variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### VectorIndexer\n",
    "\n",
    "VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees. VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order. \n",
    "\n",
    "Assume that we have a DataFrame with the columns id, hour, mobile, userFeatures, and clicked:\n",
    "\n",
    "     id | hour | mobile | userFeatures     | clicked\n",
    "    ----|------|--------|------------------|---------\n",
    "     0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0\n",
    "     \n",
    "userFeatures is a vector column that contains three user features. We want to combine hour, mobile, and userFeatures into a single feature vector called features and use it to predict clicked or not. If we set VectorAssembler’s input columns to hour, mobile, and userFeatures and output column to features, after transformation we should get the following DataFrame:\n",
    "\n",
    "     id | hour | mobile | userFeatures     | clicked | features\n",
    "    ----|------|--------|------------------|---------|-----------------------------\n",
    "     0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0     | [18.0, 1.0, 0.0, 10.0, 0.5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+----+------+--------------+-------+\n",
      "| id|hour|mobile|  userFeatures|clicked|\n",
      "+---+----+------+--------------+-------+\n",
      "|  0|  18|   1.0|[0.0,10.0,0.5]|    1.0|\n",
      "+---+----+------+--------------+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.linalg import Vectors\n",
    "from pyspark.ml.feature import VectorAssembler\n",
    "\n",
    "dataset = spark.createDataFrame(\n",
    "    [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],\n",
    "    [\"id\", \"hour\", \"mobile\", \"userFeatures\", \"clicked\"])\n",
    "dataset.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'\n",
      "+--------------------+-------+\n",
      "|            features|clicked|\n",
      "+--------------------+-------+\n",
      "|[18.0,1.0,0.0,10....|    1.0|\n",
      "+--------------------+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "assembler = VectorAssembler(\n",
    "    inputCols=[\"hour\", \"mobile\", \"userFeatures\"],\n",
    "    outputCol=\"features\")\n",
    "\n",
    "output = assembler.transform(dataset)\n",
    "print(\"Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'\")\n",
    "output.select(\"features\", \"clicked\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There ar emany more data transformations available, we will cover them once we encounter a need for them, for now these were the most important ones.\n",
    "\n",
    "Let's continue on to Linear Regression!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
